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ABSTRACT 

Mining Software Repositories (MSR) is an applied and practise- 
oriented field aimed at solving real problems encountered by prac¬ 
titioners and bringing value to Industry. Replication of results 
and findings, generalizability and external validity, University- 
Industry collaboration, data sharing and creation dataset reposi¬ 
tories are important issues in MSR research. Research consisting 
of bibliometric analysis of MSR paper shows lack of University- 
Industry collaboration, deficiency of studies on closed or propri¬ 
ety source dataset and lack of data as well as tool sharing by 
researchers. We conduct a survey of authors of past three years 
of MSR conference (2012, 2013 and 2014) to collect data on their 
views and suggestions to address the stated concerns. We asked 
20 questions from more than 100 authors and received a response 
from 39 authors. Our results shows that about one-third of the re¬ 
spondents always make their dataset publicly available and about 
one-third believe that data sharing should be a mandatory condi¬ 
tion for publication in MSR conferences. Our survey reveals that 
more than 50% authors used solely open-source software (OSS) 
dataset for their research. More than 50% of the respondents 
mentioned that difficulty in sharing Industrial dataset outside the 
company is one of the major impediments in University-Industry 
collaboration. 
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1. RESEARCH MOTIVATION AND AIM 

Mining Software Repositories (MSR) is one of the fastest grow¬ 
ing field and community within Software Engineering and con¬ 
sists of analysing the rich data available in software repositories 
to uncover interesting and actionable information about software 
systems and projects |l| |2 3 . MSR is data-driven and falls un¬ 
der Empirical Software Engineering (ESE). MSR is an applied and 
practise-oriented field aimed at solving real problems encountered 
by practitioners and bringing value to Industry. Due to the nature 
of the discipline and its objectives, there are several factors such 
as reproducibility or replication of findings or results, generaliz¬ 
ability of approach to other dataset, data sharing by researchers 
and University-Industry collaboration which are crucial in MSR 
research. Tripathi et al. conduct a bibliometric analysis of past 
five years of research papers published in MSR series of confer¬ 
ences (2010-2014) and show that out of 187 studies over a period 
of 5 years, 90.9% studies are conducted solely on OSS dataset 14 


Their findings indicate that only 14.43% of the studies involve a 
University-Industry collaboration 14 . 


Table 1: Work Profile of Survey Respondents 


[1] Are you currently working in an Industry 
or University? 

Industry 

26.32% 

University 

73.68% 

[2] What is your current job role within Indus¬ 
try or University? 

Masters Student 

5.26% 

PhD Scholar 

26.32% 

Professor 

39.47% 

Researcher in Industry 

13.16% 

Software Engineer in Industry 

7.89% 

Manager in Industry 

7.89% 


gain a deeper understanding of the stated issues and their solu¬ 
tion by conducting a survey of authors who have published papers 
in MSR conference. We conduct a survey consisting of 20 ques¬ 
tions of authors who have published papers in MSR 2012, 2013 
and 2014. We limit the scope of our analysis to only MSR series 
of conferences over the last three years. MSR research papers are 
also published in several other Software Engineering conference. 
However, selection of conferences and identifying MSR papers in 
such conferences by the authors can result in a selection bias. 
We eliminate selection bias by analysing publications only from 
MSR conference. While there have been bibliometric studies on 
MSR papers [10 14] on the topic of replication, data sharing and 
University-Industry collaboration, the work presented in this pa¬ 
per is the first study involving a survey of MSR authors. 

2. SURVEY QUESTIONNAIRE AND FINDINGS 

We sent a survey consisting of 20 questions to all authors of MSR 
2012, 2013 and 2014 conference and received a total of 39 re¬ 
sponses. We did not ask for their name or any personally iden¬ 
tifiable information of the author. We have made the survey re¬ 
sponse publicly available as an Excel filcQso that other interested 
researchers can do analysis in addition to the findings presented in 
this paper. Table [l] Table [2] and Figure [l] displays information on 
the work profile of the survey respondents. Table [T] reveals that 
nearly 75% of the survey respondents were affiliated to a Uni¬ 
versity whereas only 25% respondents were from Industry. Table 
[2] and Figure [l] shows the distribution of the survey respondents 
across roles and job profiles. We received opinions from MS and 
PhD Scholars in University, Faculty Members, Researcher, Soft¬ 
ware Engineer and Manager in an Industry. While the percentage 
of Software Engineers and Managers [non-research roles] in Indus¬ 
try is small, there is still a representation. 


The study presented in this paper is motivated by the need to 


1 http://bit.ly/lCXOV3r 











Table 2: Work Experience in MSR and Authorship in 
MSR Conference_ 


[3] How many years of experience do you 

have in Mining 

Software Repository (MSR) 

research? 


0-2 years 

31.58% 

2-5 years 

28.95% 

More than 5 years 

39.47% 

[4] How many publications (excluding data 

challenge track) you have in Mining Software 

Repository (MSR) series of conferences? 

0-2 

63.16% 

3-5 

18.42% 

More than 6 

18.42% 

[5] How many Mining Software Repository 

(MSR) series of conferences have you at- 

tended? 


0-2 

65.79% 

3-5 

18.42% 

More than 6 

15.79% 



□ Masters Student 

■ PhD Scholar 

□ Professor 

□ Researcher in Industry 

■ Software Engineer in 
Industry 

■ Manager in Industry 


Figure 1: Distribution of Survey Respondents across Job 
Roles 
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Figure 2: Survey Respondents Opinion on Enabling 
University-Industry Collaboration 
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2.1 University-Industry Collaboration 

University-Industry collaboration in Software Engineering (and 
particularly Empirical Software Engineering) is an area that has 
attracted several researcher’s attention. There are both bene¬ 
fits and challenges associated with the collaboration. Runeson 
et al. present their experiences of a 10 year Industry-Academia 
collaboration program. Their study focuses on the time-horizon 
aspects of the Industry-Academia collaboration. Their study re¬ 
veals that Industry time horizons are generally shorter compared 
to the academic perspective posing challenges to the collabora¬ 
tion jl2]. Martinez-Fernandez et al. present their practical ex¬ 
periences in designing and conducting empirical studies involving 
Industry-Academia collaboration. The focus of the collaboration 
described in their study is on Software Reference Architecture 
(SRA) projects in an IT consulting and services organization. 
Authors mention acquisition of realistic sources of data as well 
as creation of repeatable techniques and results as some of the 
major research challenges 9 . 

Enoiu et al. present an empirical exploration of enablers and im¬ 
pediments for collaborative research in Software Testing. They 
list open sharing of information for research purposes, use of a 
dedicated tooling platform and creation of a research culture as 
one of the major enablers. They mention resistance to change, 
lack of knowledge of techniques and tools evaluated in academia 
and not assuming stable research focus for the conduct of relevant 
experiments as the three main impediments [7|. Runeson et al. 
present their experiences in a 2-year University-Industry collab¬ 


Figure 3: Survey Respondents Opinion on Data Sharing 
as a Mandatory Condition for Publication 


oration project on software testing which involved on-site work 
by the researcher in the industry premises. They mention several 
factors which influence a successful collaboration: company man¬ 
agement support, champion at the company, researcher’s attitude 
and social skills and researcher’s commitment to focus on indus¬ 
try needs [llj. Wohlin et al. presents a list of top 10 challenges 
(such as trust and respect, champion, social skills, commitment 
to company needs) to work with industry based on their expe¬ 
rience from working with industry in a very close collaboration 
with continuous exchange of knowledge and information [l5]. 

We asked four questions related to University-Industry Collab¬ 
oration to MSR authors. The questions on University-Industry 
collaboration were optional (since not everyone would have en¬ 
gaged in such a collaboration) and were answered by nearly 65% 
of the survey respondents. Table[3]reveals that 12% of the engage¬ 
ments were failure, 16% successful and remaining partially suc¬ 
cessful. Tableland Figure [2] shows respondents opinion on how to 
improve the collaboration between Industry and Academia. Re¬ 
spondents could select multiple options for this question. Figure 
[2] reveals that Industry sponsored PhD fellowships and encour¬ 
aging student internships are enablers for improving the partner¬ 
ship. Difficulty in sharing Industrial data outside the company 
















































































Table 3: University-Industry Collaboration Success, Duration, Challenges and Suggestions 


| [6] Whether the University-Industry collaboration study was a success or a failure? j 

Failure 

12% 

Partially Successful 

72% 

Successful 

16% 

| [7] What is the average duration (in years) of your University-Industry Collaboration study? | 

0-1 years 

52% 

1-2 years 

44% 

More than 2 years 

4% 

| [8] What challenges you faced during the Collaboration? | 

Difficulty in sharing Industrial dataset outside the company 

52% 

Difference in focus on goal (business impact in Industry vs. scholarly impact in Academia) 

44% 

Different timeline (project deliverable timeline does not match academic milestones) 

4% 

| [9] Can you suggest some ways to improve the collaboration? | 

Industry sponsored PhD fellowships 

60% 

Encouraging student Internships 

64% 

Academic projects to be inclined towards Industry problems 

52% 

Others (please specify) 

12% 


Table 4: Data Sharing and Public Repositories 


| [10] In MSR research have you ever made your dataset publicly available? | 

Never 

4.55% 

Rarely 

0% 

Sometimes 

59.09% 

Always 

36.36% 

| [11] Is your dataset freely available or do we need to request for permissions to use it? | 

Freely, in the public domain 

100% 

With a dataset sharing agreement or license 

0% 

With a service fee for use (by industry) to help maintain the dataset 

0% 

| [12] Which platform did you use to share your dataset? | 

Home Page 

63.64% 

GitHub 

45.45% 

Bitbucket 

9.09% 

Submitted to existing repository (eg. PROMISE Repository) 

4.55% 

Others (please specify) 

13.64% 

| [13] Why do you want to share your dataset in MSR? | 

To contribute to the replication of experiments 

100% 

To allow meta-analysis (combining the findings from independent studies) 

45.45% 

To get more Empirical Software Engineering researchers involved in the MSR research 

68.18% 

Others (please specify) 

4.55% 

| [14] Should dataset sharing be a mandatory condition for publication in MSR conference? | 

No 

9.09% 

Neutral 

59.09% 

Yes 

31.82% 






Table 5: OSS/CSS Dataset, Generalizability of Findings and Threats to External Validity 


| [15] What type of dataset you have used in your MSR research study? j 

Solely Open Source Software (OSS) 

54.29% 

Both (but mostly OSS) 

22.86% 

Both( but mostly CSS) 

5.71% 

Solely Closed/Proprietary Source Software (CSS/PSS) 

8.57% 

Both (equally used) 

8.57% 

| [16] Do you believe that the results/findings on OSS dataset can be generalized to CSS dataset? | 

Never 

5.71% 

Rarely 

22.86% 

Sometimes 

71.43% 

Always 

0% 


[17] Do you believe that within OSS dataset there is enough diversity for researchers to test for 
generalizability? 


Never 
Rarely 
Sometimes 
Always 

[18] Why do you believe that threats to external validity exists? 

Lack of accessibility to CSS dataset 
Usage of few well-known and OSS dataset 
Others (please specify) 


[19] How can we improve external validity concerns in MSR research? 


Creating benchmark suite by the research community and sharing the analysis results on it 
Reviewers must discuss the validity concerns and evaluation criteria for paper selection 
Making the dataset publicly available 

Others (please specify) 

62.50% 

31.25% 

81.25% 

3.13% 

| [20] Can you suggest ways to increase the contribution of studies using CSS/PSS dataset in MSR | 

research? 


Industry - Academia collaboration should be more promoted 

65.71% 

Sharing of CSS/PSS dataset by anonymization 

54.29% 

Others (please specify) 

11.43% 


75.00% 

40.63% 

6.25% 


6.25% 

25.00% 

62.50% 

6.25% 
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Figure 4: Extent of OSS and CSS Dataset Usage in MSR 
Research Studies 


was selected as one of the major challenges and impediments en¬ 
countered by the researchers in University-Industry collaboration. 


2.2 Data Sharing and SE Data Repositories 

Sharing of Software Engineering data and creation of SE data 
repositories for the purpose of conducting benchmarking, experi¬ 
mental and empirical studies is critical in a discipline like Empiri¬ 
cal Software Engineering and Mining Software Repositories where 
the validity of the scientific results and conclusions is highly de¬ 
pendent on the underlying dataset used for experiments. There 
have been attempts for creation of public data repositories in Soft¬ 
ware Engineering field where researchers can upload real-world 
project data. Cheikhi et al. present their analysis of two largest 
of the small number of software engineering repositories publicly 
available: the ISBSG Repository which contains datasets cover¬ 
ing a considerable number of fields, and the PROMISE repository 
with its large number of different datasets [5]. 

Cukic et al. mention that lack of publicly available SE datasets 
results in poorly validated models. Furthermore, they mention 
that dataset submission by organizations to public repositories is 
challenging due to the fact that public release of any data that 
could link a company with a negative image is a major deterrent 
for the company towards sharing their data |6j. Fernandez-Diego 
et al. present an analysis of the potential and limitations of the 
International Software Benchmarking Standards Group (ISBSG) 
dataset. They study how and to what extent ISBSG has been 
used by researchers from the year 2000. Their analysis reveals 
that studies including dataset from ISBSG were published in 19 
Journals and 40 Conferences [8]. 

Table [4] reveals that 36.36% of the respondents always make their 
dataset publicly available. It is interesting to note that all those 
who share their dataset make their dataset freely available and 
does not require any sharing fee or dataset sharing agreement 
or license. Table [4] and Figure [3] presents respondents opinion on 
whether dataset sharing should be a mandatory condition for pub¬ 
lication in MSR conference. Our survey reveals that 31.82% feel 
that it should be mandatory while 59.09% are neutral. Questions 
12 and 13 in Table [4] are questions in which the respondent can se¬ 
lect multiple answers. We observe that sharing data on home-page 
rather than a public repository was the most common platform 
for making the data available. It is interesting to note that project 
web-hosting websites like GitHub and BitBucket are more widely 
used than well-known public repositories like PROMISE for shar¬ 
ing dataset. All most all of the respondents believe that they 
want to share their dataset with other researchers to encourage 
replication, meta-analysis and enable more Empirical Software 
Engineering researchers involved in Mining Software Repositories 


research. 

2.3 Replication and Threats to External Validity 

Robles et al. conduct a study of 171 papers from six MSR confer¬ 
ences (year 2004 to 2009) that contained any experimental anal¬ 
ysis of software projects for their potentiality of being replicated. 
Their findings show that MSR authors use in general publicly 
available data sources [such as data from OSS projects like Google 
Android and Chromium, Mozilla FireFox and Eclipse], mainly 
from free software repositories, but that the amount of publicly 
available processed datasets is very low. They also investigated 
the public availability of tools and scripts created by authors and 
show that for a majority of papers they were not able to find any 
tool, even for papers where the authors explicitly state that they 
have built one [To] . 

Shull et al. mention that reproducibility and replication in Em¬ 
pirical Software Engineering research is important and the two 
important goals of replication are to gain confidence in results of 
previous studies and also for understanding the scope of the re¬ 
sults 13]. Barr et al. mention that Software engineering research 
will advance further and faster if the sharing of data and tools 
were easier and more widespread. They discuss pragmatic con¬ 
cerns such as the time and effort required and the risk of being 
scooped which hinder the realization of this idea of data sharing. 
They examine the costs and benefits of facilitating sharing in the 
field of Software Engineering in an effort to help the community 
understand what problems exist and find a solution |4|. 

Table [5] shows the results of MSR authors on the topic of usage 
of OSS/CSS dataset, generalizability of the experiments are con¬ 
ducted solely on OSS dataset. Answers to Question 15 in Table 
[5] shows lack of empirical studies on closed or proprietary dataset 
(refer to Figure [4|. Nearly 30% of the respondents believe that 
the results and findings on OSS dataset can never or rarely be 
generalized to CSS dataset. Similarly, nearly 30% of the respon¬ 
dents believe that there is not enough (rarely or never) diversity 
within OSS dataset for researchers to test for generalizability. 

Questions 18 — 20 in Tableware questions in which the respondent 
can select multiple answers. 75% of the respondents believe that 
lack of accessibility of CSS dataset as one of the major threats 
to external validity. We asked questions to MSR authors on how 
can we improve external validity concerns in MSR research and 
Can you suggest ways to increase the contribution of studies using 
CSS/PSS dataset in MSR research? Respondents mention that 
making the dataset publicly available and creating benchmark 
suite by the research community and sharing the analysis results 
on it can improve external validity concerns. Industry - Academia 
collaboration should be more promoted and sharing of CSS/PSS 
dataset by anonymization are ways to increase the contribution 
of studies using CSS/PSS dataset in MSR research. 

3. CONCLUSION 

We conduct a survey of authors of past three years of MSR con¬ 
ference (2012, 2013 and 2014) to collect data on their views and 
suggestions to address the stated concerns. Nearly 75% of the sur¬ 
vey respondents were affiliated to a University whereas only 25% 
respondents were from Industry. We received opinions from MS 
and PhD Scholars in University, Faculty Members, Researcher, 
Software Engineer and Manager in an Industry. Our survey re¬ 
veals that Industry sponsored PhD fellowships and encouraging 
student internships are enablers for improving the partnership be¬ 
tween Industry and Academia. Our findings shows that 31.82% 
of the respondents feel that data sharing should be a mandatory 

















condition for publication while 59.09% are neutral. All most all 
of the respondents believe that they want to share their dataset 
with other researchers to encourage replication, meta-analysis and 
enable more Empirical Software Engineering researchers involved 
in Mining Software Repositories research. Respondents mention 
that making the dataset publicly available and creating bench¬ 
mark suite by the research community and sharing the analysis 
results on it can improve external validity concerns. Industry - 
Academia collaboration should be more promoted and sharing 
of CSS/PSS dataset by anonymization are ways to increase the 
contribution of studies using CSS/PSS dataset in MSR research. 
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